34 research outputs found
Information-theoretic causal inference of lexical flow
This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision
Information-theoretic causal inference of lexical flow
This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages
Developing a TT-MCTAG for German with an RCG-based parser
Developing linguistic resources, in particular grammars, is known to be a complex task in itself, because of (amongst others) redundancy and consistency issues. Furthermore some languages can reveal themselves hard to describe because of specific characteristics, e.g. the free word order in German. In this context, we present (i) a framework allowing to describe tree-based grammars, and (ii) an actual fragment of a core multicomponent tree-adjoining grammar with tree tuples (TT-MCTAG) for German developed using this framework. This framework combines a metagrammar compiler and a parser based on range concatenation grammar (RCG) to respectively check the consistency and the correction of the grammar. The German grammar being developed within this framework already deals with a wide range of scrambling and extraction phenomena
TuLiPA : a syntax-semantics parsing environment for mildly context-sensitive formalisms
In this paper we present a parsing architecture that allows processing of different mildly context-sensitive formalisms, in particular Tree-Adjoining Grammar (TAG), Multi-Component Tree-Adjoining Grammar with Tree Tuples (TT-MCTAG) and simple Range Concatenation Grammar (RCG). Furthermore, for tree-based grammars, the parser computes not only syntactic analyses but also the corresponding semantic representations
TuLiPA : towards a multi-formalism parsing environment for grammar engineering
In this paper, we present an open-source parsing environment (TĂźbingen Linguistic Parsing Architecture, TuLiPA) which uses Range Concatenation Grammar (RCG) as a pivot formalism, thus opening the way to the parsing of several mildly context-sensitive formalisms. This environment currently supports tree-based grammars (namely Tree-Adjoining Grammars (TAG) and Multi-Component Tree-Adjoining Grammars with Tree Tuples (TT-MCTAG)) and allows computation not only of syntactic structures, but also of the corresponding semantic representations. It is used for the development of a tree-based grammar for German
TuLiPA : towards a multi-formalism parsing environment for grammar engineering
In this paper, we present an open-source parsing environment (TĂźbingen Linguistic Parsing Architecture, TuLiPA) which uses Range Concatenation Grammar (RCG) as a pivot formalism, thus opening the way to the parsing of several mildly context-sensitive formalisms. This environment currently supports tree-based grammars (namely Tree-Adjoining Grammars (TAG) and Multi-Component Tree-Adjoining Grammars with Tree Tuples (TT-MCTAG)) and allows computation not only of syntactic structures, but also of the corresponding semantic representations. It is used for the development of a tree-based grammar for German
Evaluating the Potential of a Large-Scale Polysemy Network as a Model of Plausible Semantic Shifts
We present a very large network of crosslinguistic
polysemies, and compare the notion of semantic
relatedness it encodes to the catalogue of semantic shifts
maintained by the Russian Academy of Sciences. We
separately evaluate all types of semantic shifts featured in
the catalogue, including shifts occurring during semantic
evolution, during borrowing, and during morphological
derivation. The comparison shows that over one third of
the attested semantic shifts take place between close neighbors
in the network. This can be considered strong evidence
for the usefulness of polysemy networks in modelling most
types of lexical change, making them a valuable resource
e.g. for semantic reconstruction or future automatization
of cognate detection. We also show that the semantic
shifts which occur during morphological derivation form a
divergent class, and might need to be modelled separately
Masking Treebanks for the Free Distribution of Linguistic Resources and Other Applications
Proceedings of the Sixth International Workshop on Treebanks and
Linguistic Theories.
Editors: Koenraad De Smedt, Jan HajiÄ and Sandra KĂźbler.
NEALT Proceedings Series, Vol. 1 (2007), 127-138.
Š 2007 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/4476
Using computational criteria to extract large Swadesh lists for lexicostatistics
We propose a new method for empirically determining lists of basic concepts for the purpose of compiling extensive lexicostatistical databases. The idea is to approximate a notion of âswadeshnessâ formally and reproducibly without
expert knowledge or bias, and being able to rank any number of concepts given enough data. Unlike previous approaches, our procedure indirectly measures both stability of concepts against lexical replacement, and their proneness to phenomena such as onomatopoesia and extensive borrowing. The method provides a fully automated way to generate customized Swadesh lists of any
desired length, possibly adapted to a given geographical region. We apply the method to a large lexical database of Northern Eurasia, deriving a swadeshness ranking for more than 5,000 concepts expressed by German lemmas. We evaluate this ranking against existing shorter lists of basic concepts to validate the method, and give an English version of the 300 top concepts according to this ranking